Model Assessment with K-Fold Cross Validation

Harry Snart, SAS Institute

October 2024

This document shows how K-Fold Cross Validation can be used to assess model goodness of fit with few holdout samples. We start by loading the HMEQ dataset which has a binary target of BAD. After performing a brief exploratory analysis we then perform oversampling on the event class and then partition the dataset into Train, Test and Validate. We then train a logistic regression model with stepwise selection and perform k-fold sampling on the holdout dataset to score each of the partitions in order to generate a distribution of model assessment scores.

Load Dataset

Here we load the dataset using PROC IMPORT then print via PROC PRINT


Sample of HMEQ Dataset

BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
1 1100 25860 39025 HomeImp Other 10.5 0 0 94.366666667 1 9 .
1 1300 70053 68400 HomeImp Other 7 0 2 121.83333333 0 14 .
1 1500 13500 16700 HomeImp Other 4 0 0 149.46666667 1 10 .
1 1500 . .     . . . . . . .
0 1700 97800 112000 HomeImp Office 3 0 0 93.333333333 0 14 .

Exploratory Data Analysis

Here we perform an exploratory data analysis including variable correlation with PROC CORR, variable summary analysis with PROC CARDINALITY and visual analysis with PROC SGPLOT


Check for class imbalance

There is a class imbalance in the dataset which can be addressed with oversampling

Bad Number of Observations
0 4771
1 1189

Variable Summary

Variable name Type of the raw values Number of levels Number of observations Number of missing values Mean Standard deviation
LOAN N 20 1189 0 16922.119428 11418.455152
MORTDUE N 20 1189 106 69460.452973 47588.194467
VALUE N 20 1189 105 98172.846227 74339.822506
REASON C 2 1189 48 . .
JOB C 6 1189 23 . .
YOJ N 20 1189 65 8.0278024911 7.1007348316
DEROG N 11 1189 87 0.7078039927 1.468380909
DELINQ N 14 1189 72 1.2291853178 1.9029614156
CLAGE N 20 1189 78 150.19018341 84.952286255
NINQ N 16 1189 75 1.7827648115 2.2469764219
CLNO N 20 1189 53 21.211267606 11.81298083
DEBTINC N 20 1189 786 39.387644892 17.723586299

Variable Correlation

Variables such as Loan amount appear to have an explanatory power for Bad.

11 Variables: BAD LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
Variable Correlation; Variables such as Loan amount appear to have an explanatory power for Bad.; Scatter Plot Matrix Scatter Plot Matrix 0 10 20 30 40 0 500000 0 400000 0 80000 0.0 0.4 0.8 0 10 20 30 40 0 100000 200000 300000 400000 500000 0 100000 200000 300000 400000 0 20000 40000 60000 80000 0.0 0.2 0.4 0.6 0.8 1.0 YOJ VALUE MORTDUE LOAN BAD

Levels of character variable: Reason; The SGPlot Procedure 0 1000 2000 3000 4000 Frequency HomeImp DebtCon REASON 1 0 BAD Levels of character variable: Reason

Levels of character variable: Job; The SGPlot Procedure 0 500 1000 1500 2000 2500 Frequency Self Sales ProfEx Other Office Mgr JOB 1 0 BAD Levels of character variable: Job

Perform oversampling of event class

Here we oversample the event class, 1, given that the exploratory analysis shows there is a class imbalance we do this using PROC PARTITION


Levels of character variable: Job

The PARTITION Procedure

Oversampling Frequency
Index BAD Number
of Obs
Number
of Samples
0 0 4771 1070
1 1 1189 1070
Output CAS Tables
CAS Library Name Number
of Rows
Number
of Columns
CASUSER(sukhsn) SAMPLES 2140 14

Oversampled Group; The SGPlot Procedure 0 200 400 600 800 1000 Frequency 1 0 BAD Oversampled Group

Oversampled Group

The PARTITION Procedure

Stratified Sampling Frequency
Index BAD Number
of Obs
Sample
Size 1
Sample
Size 2
0 0 1070 535 268
1 1 1070 535 268
Output CAS Tables
CAS Library Name Number
of Rows
Number
of Columns
CASUSER(sukhsn) HMEQ_PART 2140 15

Check number of observations by partition; The SGPlot Procedure 0 200 400 600 800 1000 Frequency Validate Train Test PartName 1 0 BAD Check number of observations by partition

Create Logistic Regression Model

Here we perform stepwise Logistic Regression using the Train and Test partitions using PROC LOGSELECT. The procedure prints summary statistics for both partitions.

We also save the scoring code to a SAS file that we can then use to score the kfold partitions later.


The LOGSELECT Procedure

Model Information
Data Source TRAIN_TEST
Response Variable BAD
Distribution Binary
Link Function Logit
Optimization Technique Newton-Raphson with Ridging
Predicted Response P_BAD
Predicted Response Level I_BAD
Number of Observations
Description Total Training Testing
Number of Observations Read 1606 1070 536
Number of Observations Used 708 472 236
Response Profile
Ordered
Value
BAD Total
Frequency
Training Testing
1 0 504 334 170
2 1 204 138 66

Probability modeled is BAD = 1.

Class Level Information
Class Levels Values
REASON 2 DebtCon HomeImp
JOB 6 Mgr Office Other ProfEx Sales Self
Selection Information
Selection Method Stepwise
Select Criterion SBC
Choose Criterion SBC
Stop Criterion SBC
Effect Hierarchy Enforced None
Stop Horizon 3

Selection Details

Convergence criterion (GCONV=1E-8) satisfied.
Selection Summary
Step Effect
Entered
Number
Effects In
SBC
* Optimal Value Of Criterion
0 Intercept 1 576.5809
1 DELINQ 2 529.4067
2 DEBTINC 3 502.9955
3 DEROG 4 485.5381*
Stepwise selection stopped because adding or removing an effect does not improve the SBC criterion.
The model at step 3 is selected where SBC is 485.5381.
Selected Effects: Intercept DEROG DELINQ DEBTINC

Selected Model

Dimensions
Columns in Design 4
Number of Effects 4
Max Effect Columns 1
Rank of Design 4
Parameters in Optimization 4
Testing Global Null Hypothesis: BETA=0
Test DF Chi-Square Pr > ChiSq
Likelihood Ratio 3 107.3951 <.0001
Fit Statistics
Description Training Testing
-2 Log Likelihood 463.02883 251.09410
AIC (smaller is better) 471.02883 259.09410
AICC (smaller is better) 471.11448 259.26726
SBC (smaller is better) 487.65675 272.94943
Average Square Error 0.15734 0.17402
-2 Log L (Intercept-only) 570.42396 279.72272
R-Square 0.20350 0.11424
Max-rescaled R-Square 0.29015 0.16453
McFadden's R-Square 0.18827 0.10235
Misclassification Rate 0.22669 0.24576
Difference of Means 0.23429 0.14475
Parameter Estimates
Parameter DF Estimate Standard
Error
Chi-Square Pr > ChiSq
Intercept 1 -4.270484 0.572775 55.5887 <.0001
DEROG 1 0.619086 0.183007 11.4437 0.0007
DELINQ 1 0.650306 0.121775 28.5179 <.0001
DEBTINC 1 0.080297 0.014990 28.6929 <.0001
Task Timing
Task Seconds Percent
Setup and Parsing 0.00 9.30%
Levelization 0.00 2.27%
Model Initialization 0.00 0.85%
SSCP Computation 0.00 6.58%
Model Selection 0.03 77.60%
Producing Score Code 0.00 2.33%
Display 0.00 0.79%
Cleanup 0.00 0.01%
Total 0.04 100.00%

Visualise Model Fit on Test Dataset

Here we score the Test dataset using DataStep scorecode and visualise the ROC, Lift & Response charts.


ROC Curve Target = BAD, Event = 1; The SGPlot Procedure 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 True positive rate AUC= 0.73 ROC Curve (Target = BAD, Event = 1)

Lift Chart Target = BAD, Event = 1; The SGPlot Procedure 0 20 40 60 80 100 Population Percentage 0.5 1 1.5 2 2.5 Lift Lift Chart (Target = BAD, Event = 1)

Cumulative Lift Chart Target = BAD, Event = 1; The SGPlot Procedure 0 20 40 60 80 100 Population Percentage 1 1.5 2 2.5 Lift Cumulative Lift Chart (Target = BAD, Event = 1)

Cumulative Response Rate Target = BAD, Event = 1; The SGPlot Procedure 0 20 40 60 80 100 Population Percentage 0 20 40 60 80 100 Response Percentage Cumulative Response Rate (Target = BAD, Event = 1)

Perform K-Fold Cross Validation

Here we define a macro, kFoldCV, which uses the CAS Sampling Actionset to perform k-fold partitioning stratified by BAD. We then score each dataset and append the results to a single table including paritition identifier. Finally, we use PROC ASSESS which runs model assessment by Kfold partition.

Visualise Estimated Fit Statistics by Kfold

Here we retain only values for the 0.5 cutoff from the ROC and visualise the estimated distributions for KS, Accuracy, F1, AUC, Gini and Misclassification rate from our k-fold partitions.

Cumulative Response Rate (Target = BAD, Event = 1)